Efficient Sphere-Covering and Converse Measure Concentration 

g ; Via Generalized Source Coding Theorems 

o 

I. Kontoyiannis 

Oh! 
«D . 
in ■ February 1, 2008 

(N 

P_| ' Abstract 






Suppose A is a finite set, let P be a discrete probability distribution on A, and let M be 

an arbitrary "mass" function on A. We give a precise characterization of the most efficient 

way in which A" can be almost-covered using spheres of a fixed radius. An almost-covering 

.^ _ is a subset C„ of A", such that the union of the spheres centered at the points of C„ has 

^ ' probability close to one with respect to the product distribution P". Spheres are defined 

CN i in terms of a single-letter distortion measure on A", an efficient covering is one with small 

(^ , mass M^{Cn), and n is typically large. In information-theoretic terms, the sets C„ are 

^D [ rate-distortion codebooks, but instead of minimizing their size we seek to minimize their 

mass. With different choices for AI and the distortion measure on A our results give various 

0> ' corollaries as special cases, including Shannon's classical rate-distortion theorem, a version 

r^ ■ of Stein's lemma (in hypothesis testing), and a new converse to some measure-concentration 

j^ , inequalities on discrete spaces. Under mild conditions, we generalize our results to abstract 

H ! spaces and non-product measures. 
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1 Introduction 

Suppose ^ is a finite set and let P a discrete probability mass function on A (more general 
probability spaces are considered later). Assume that the distortion (or distance) p(x, y) between 
two symbols (or points) x and y from A is measured by a fixed p : Axyl ^ [0, oo), and for each 
n > 1 define a single-letter distortion measure (or coordinate-wise distance function) pn by 



1 " 

-^p{xi,yi), (1) 



Pn{xi,yi) 

i=l 

for x^ = {xi,X2,...,Xn) and yf = (yi,y2,--- ,2/™) in A^. 

Given a Z? > 0, we want to "almost" cover the product space A"^ using a finite number of 
balls (or "spheres") B{yi,D), where 

B{y^, D) = {x^ € A" : p„(x^ <) < D} (2) 

is the (closed) ball of distortion-radius D centered at y^ ^ A^. For our purposes, an "almost 
covering" is a subset C C A^, such that the union of the balls of radius D centered at the points 
of C have large P^-probability, that is, 

-P"([C]d) is close to 1, (3) 

where [C]^ is the D-hlowup of C 

[C]o = U B{y^,D) = K : Pn(x?,yl^) < D for some y^ G C}. 

More specifically, given a "mass function" M : A — > (0, cxo), we are interested in covering A^ 
efficiently, namely, finding sets C that satisfy (|3|) and also have small mass 

n 

M-{C) = Y, M-{y^) = Y, \{M{y.{). 

One way to state our main question of interest is as follows: 

' If the sets {Cn ; n > 1} asymptotically D-cover A^, that is, 
(*) < P"([C7„]^)^1 asn^oo, 

, how small can their masses M^{Cn) be? 

Question (*) is partly motivated by the fact that several interesting questions can be easily 
restated in this form. Three such examples are presented below, and in the remainder of the 
paper (*) is addressed and answered in detail. In particular, it is shown that M"-{Cn) typically 
grows (or decays) exponentially in n, and an explicit lower bound, valid for all finite n, is given 
for the exponent {\/n) log M"(C„) of the mass of an arbitrary C„. [Throughout the paper, 'log' 



denotes the natural logarithm.] Moreover, a sequence of sets C„ asymptotically achieving this 
lower bound is exhibited, showing that it is best possible. The outline of the proofs follows, 
to some extent, along similar lines as the proof of Shannon's rate-distortion theorem [^. In 
particular, the "extremal" sets C„ achieving the lower bound are constructed probabilistically; 
each Cn consists of a collection of points y" generated by taking independent and identically 
distributed (IID) samples from a suitable distribution on A", but (unlike Shannon) here we need 
to condition on seeing typical realizations, making the individual elements of the random Cn 
non-IID. 

Example 1. (Measure Concentration on the Binary Cube) Take A = {0,1} so 
that A"' is the n-dimensional binary cube consisting of all binary strings of length n, and let 
P"' be a product probability distribution on A". Write pn{xi,y1) for the normalized Hamming 
distortion between Xi and y", so that pn{xi,yi) is the proportion of mismatches between the 
two strings; formally: 

1 "" 
Pnix-„y^) = -Y,h..^y.}, x-„y^eA\ (4) 

i=l 

Geometrically, if A"" is given the usual nearest-neighbor graph structure (two points are con- 
nected if and only if they differ in exactly one coordinate), then /?„(x", y") is the graph distance 
between x" and y", normalized by n. 

A well-known measure-concentration inequality for subsets C„ of A^ states that, for any 
D>0, 



P^'iCn) 



P"([C„U > 1 - ^^^^7^. (5) 



[See Proposition 2.1.1 in the comprehensive account by Talagrand |]l^], or Theorem 3.5 in the 
review paper by McDiarmid [12|, and the references therein.] Roughly speaking, (^) says that "if 
Cn is not too small, [Cn]^, is almost everything." In particular, it implies that for any sequence 
of sets Cn C A"- and any D > 0, 

if liminf-logP"(C„)> -1)2/2, then P«([C„]^) ^ 1. (6) 

n— >oo n 

A natural question to ask is whether there is a converse to the above statement: If i^"'([C„]^) — > 
1, how small can the probabilities of the C„ be? Taking M = P, this reduces to question 
(*) above. In this context, (*) can be thought of as the opposite of the usual isoperimetric 
problem. We are looking for sets with the "largest possible boundary"; sets C„ whose D- 
blowups (asymptotically) cover the entire space, but whose volumes P^{Cn) are as small as 
possible. A precise answer for this problem is given in Corollary 3 and the discussion following 
it, in the next section. 



Example 2. (Lossy Data Compression) Let A be a finite alphabet so that A^ consists 
of all possible messages of length n from A, and assume that messages are generated by a mem- 
oryless source, with distribution P" on A'^. A code for these messages consists of a codebook 
Cn C A^ and an encoder (/;„ : A"^ — > C„. If we think of /9„(x",y") as the distortion between 
a message x" and its reproduction y", then for any given codebook Cn the best choice for the 
encoder is clearly the map (pn taking each x" to the y" in Cn which minimizes the distortion 
Pnixi,yi)- Hence, at least conceptually, finding good codes is the same as finding good code- 
books. More specifically, if Z? > is the maximum amount of distortion we are willing to 
tolerate, then a sequence of good codebooks {Cn} is one with the following properties: 

(a) The probability of encoding a message with distortion exceeding D is asymptotically neg- 
ligible: 

(6) Good compression is achieved, that is, the sizes |C„| of the codebooks are small. 
What is the best achievable compression performance? That is, if the codebooks {Cn} satisfy 



(a), how small can their sizes be? Shannon's classical source coding theorem (cf. |15|y]) answers 
this question. In our notation, taking M = 1 reduces the question to a special case of (*), and 
in Corollary 2 in the next section we recover Shannon's theorem as a special case of Theorems 1 
and 2. 

Example 3. (Hypothesis Testing) Let ^ be a finite set and Pi, P2 be two probability 
distributions on A. Suppose that the null hypothesis that a sample X^ = (Xi, X2, . . . , X„) of n 
independent observations comes from Pi is to be tested against the simple alternative hypothesis 
that X" comes from P2. A test between these two hypotheses can be thought of as a decision 
region Cn C A"". If Xf G C„ we declare that Xf ~ P", otherwise we declare Xf ~ Pg". The 
two probabilities of error associated with this test are 

a„ = Pi"(C7^) and Pn = P^{Cn). (7) 

A good test has these two probabilities vanishing as fast as possible, and we may ask, if a„ — > 0, 
how fast can /3„ decay to zero? Taking p to be Hamming distortion, Z? = 0, P = Pi, and 
M = P2, this reduces to our original question (*). In Corollary 1 in the next section we answer 
this question by deducing a version of Stein's lemma from Theorems 1 and 2. It is worth noting 
that the connection between questions in hypothesis testing and information theory goes at least 
as far back as Strassen's 1964 paper |l^ (see also Blahut's paper [^ in 1974, and Csiszar and 
Korner's book Q for a detailed discussion). 

The rest of the paper is organized as follows. In Section 2, Theorems 1 and 2 provide an 
answer to question (*). In the remarks and corollaries following Theorem 2 we discuss and 



interpret this answer, and we present various applications along the lines of the three examples 
above. Theorem 1 is proved in Section 2 and Theorem 2 is proved in Section 3. In Section 4 
we consider the same problem in a much more general setting. We let A be an abstract space, 
and instead of product measures -P" we consider the n-dimensional marginals P„ of a stationary 
measure P on A^. In Theorems 3 and 4 we give analogs of Theorems 1 and 2, which hold 
essentially as long as the spaces (A^^Pn) can be almost-covered by countably many p„-balls. 
Although the results of Section 2 are essentially subsumed by Theorems 3 and 4, it is possible 
to give simple, elementary proofs for the special case treated in Theorems 1 and 2, so we give 
separate proofs for these results first. The more general Theorems 3 and 4 are proved in Section 5, 
and the Appendix contains the proofs of various technical steps needed along the way. 

2 The Discrete Memoryless Case 

Let A be a finite set and P be a discrete probability mass function on A. Fix a p : Ax A -^ [0, oo), 
and for each n > 1 let pn be the corresponding single-letter distortion measure (or coordinate- 
wise distance function) on A"" defined as in (ffl). Also let M : yl — > (0, cxo) be an arbitrary positive 
mass function on A. We assume, without loss of generality, that P{a) > for all a (^ A, and 
also that for each a a A there exists a 6 E A with p{a, b) = (otherwise we may consider 
p'{x,y) = [p{x,y) — min^g^ /7(x, z)] instead of p). Let {X„} denote a sequence of IID random 
variables with distribution P, and write P = P^ for the product measure on A^ equipped with 
the usual c-algebra generated by finite-dimensional cylinders. We write X^ for vectors of random 
variables (A'j,Xj+i, . . . ,Xj), 1 < i < j < oo, and similarly xj = {xi,Xi^i, . . . ,Xj) € A^~^'^^ for 
realizations of these random variables. 

Next we define the rate function R{D) that will provide the lower bound on the exponent of 
the mass of an arbitrary C„ C j4". For Z) > and Q a probability measure on A, let 

I(P,Q,D)= inf H(W\\PxQ) (8) 

WeM(P,Q,D) 

where HdiWv) denotes the relative entropy between two discrete probability mass functions fi 
and 1/ on a finite set 5, 



H{p\\u)=Y,Ks)^og 



p{s) 



and where A4{P,Q,D) consists of all probability measures W on Ax A such that Wx, the 
first marginal of W, is equal to P, Wy, the second marginal, is Q, and E\y[p{X,Y)] < D; if 
A4{P,Q,D) is empty, we let I{P,Q,D) = oo. The rate function R{D) is defined by 

RiD) = R{D;P,M)=mi{I{P,Q,D)+EQ[logM{Y)]}, (9) 



where the infimum is over aU probabihty distributions Q on A. Recahing the definition of 
the mutual information between two random variables, R{D) can equivalently be written in a 
more information-theoretic way. If {X, Y) are random variables (or random vectors) with joint 
distribution W and corresponding marginals Wx and Wy, then the mutual information between 
X and Y is defined as 

I{X- Y) = H{W\\Wx X Wy). 

Combining the two infima in (^) and (^) we can write 

RiD)= inf {IiX;Y)+E\\ogMiY)]] (10) 

{X,Y):Xr^P,Ep{X,Y)<D 

where the infimum is taken over all jointly distributed random variables {X, Y) such that X has 
distribution P and Ep{X,Y) < D. For any x" G A" and Cn C A", write 

Pn{xi,Cn)= min /?„(x^,yn. 

In the following two Theorems we answer question (*) stated in the Introduction. Theorem 1 
contains a lower bound (valid for all finite n) on the mass of an arbitrary C„ C vl", and Theorem 2 
shows that this bound is asymptotically tight. In information-theoretic terms, Theorems 1 and 2 
can be thought of as generalized direct and converse coding theorems, for minimal-mass (rather 
than minimal-size) codebooks. 

Theorem 1. Let Cn C A^ be arbitrary and write D = Epn[pn{Xf,Cn)]- Then 

- log M^{Cn)>R{D). 
n 

Theorem 2. Assume that p{x, y) = if and only if x = y. For any D > and any e > 
there is a sequence of sets {Cn} such that: 

(i) - log M'\Cn) < RiD) + e for all n > 1 

n 

(ii) pn{Xi,Cn) < D eventually, ¥ — a.s. 

Remark 1. Part {ii) of Theorem 2 says that \cn] i^i) ~^ 1 with probability one, so by 
Fatou's lemma, P" ([C'ralj,) -^ 1- From this and (i) it is easy to deduce the following alternative 
version of Theorem 2 (see the Appendix for a proof): For any D >0 there is a sequence of sets 
{C*} such that: 

(i') limsup - logM"(C*) < RiD) 

n—foo T^ 



{ii') P«([Q]J^1, and 

{Hi') limsup ^pn[p„(Xi",C*)] < D 



Remark 2. As will become evident from the proof of Theorem 2, the additional assumption 
on p is only made for the sake of simplicity, and it is not necessary for the validity of the result. 
In particular, it allows us to give a unified argument for the cases D = and D > 0. 

Theorem 1 is proved at the end of this section, and Theorem 2 is proved in Section 3. 
Although the proof of Theorem 2 is somewhat technical, the idea behind the construction of the 
extremal sets Cn is simple: Suppose Q* is a probability measure on A achieving the infimum in 
the definition of R{D), so that 

R{D) = liP, Q*,D)+ Eq, [log MiY)] = I* + L* . 

Write Q* for the product measure (Q*)", and let Q„ be the measure obtained by conditioning 
(5* to the set of points y" G A^ whose empirical measures ("types") are uniformly close to Q* . 
Then let C„ consist of approximately e"^* points yf drawn IID from Q„. Each point in the 
support of Qn has mass M^{yi) ~ e * and C„ contains about e of them, so M'^{Cn) is close 
to e e * = e^^^ '. The main technical content of the proof is therefore to prove {ii), namely, 
that e"^^* points indeed suffice to almost I?-cover A". 

The above construction also provides a nice interpretation for R{D). If we had started 
with a different measure Q in place of Q*, we would have ended up with sets C^ of size 
ss ex.p{nI{P,Q,D)), consisting of points y" of mass M'^{y^) « exp(n£'Q(logM(y))), and the 
total mass of C^ would be 

M"(C;) ^ exp{n[/(P,Q,D) +ii;Q(logM(y))]}. 

By optimizing over the choice of Q in (0) we are balancing the tradeoff between the size and the 
weight of the set C^, between a few heavy points and many light ones. 

It is also worth noting that the extremal sets C„ above were constructed by taking samples 
y" from the non-product measure Qn- Unlike in Shannon's proof of the data compression 
theorem, here we cannot get away by simply using the product measure Q^- This is because 
we are not just interested in how many points y" are needed to almost cover A^^ but also we 
need control their masses M^{y'^). Since exponentially many y^'s are required to cover ^", if 
they are generated from Q* then there are bound to be some atypically heavy ones, and this 
drastically increases the total mass M^{Cn)- Therefore, by restricting Q* to be supported on 
the set of y" S A" whose empirical measures are uniformly close to Q*, we are ensuring that 
the masses of the y" will be essentially constant, and all approximately equal to e * . 

Next we derive corollaries from Theorems 1 and 2, along the lines of the examples in the 
Introduction. First, in the context of hypothesis testing, let Pi, P2 be two probability distribu- 
tions on A with all positive probabilities. Suppose that the null hypothesis that X" ~ P" is 
to be tested against the alternative X" ~ P^*. Given a test with an associated decision region 



Cn C A^, its two probabilities of error a„ and /3„ are defined as in (1^. In the notation of this 
section, let pn be Hamming distortion as in (^), P = Pi and M = P2. Observe that, here, 

and define, in the notation of @, the error exponent 

e{a) = -R{a]Pi,P2), aG[0,l]. 

Noting that e(0) = H{Pi\\P2), from Theorems 1 and 2 and Remark 1 we obtain the following 
version of Stein's lemma (see Lemma 6.1 in Bahadur's monograph |1[, or Theorem 12.8.1 in |^). 

Corollary 1. (Hypothesis Testing) Let a = an = Pf (C^) and j3 = I3n = P^^Cn) he 
the two types of error probabilities associated with an arbitrary sequence of tests {Cn}- 

(i) For all n> 1, /3>e~"^("). 

(ii) If On — > 0, then 

\immi-logPn>-H{Pi\\P2). 

(iii) There exists a sequence of decision regions Cn with associated tests whose error probabilities 
achieve a„ — > and (l/n)log/3„ — *• —H{Pi\\P2), as n ^ 00. 

Note that, although the decision regions Cn in {iii) above achieve the best exponent in the 
error probability, they are not the overall optimal decision regions in the Neyman-Pearson sense. 

In the case of data compression, we have random data X^ generated by some product 
distribution P" . Given a single-letter distortion measure pn and a maximum allowable distortion 
level D > 0, our objective is to find good codebooks C„. As discussed in Example 2 above, good 
codebooks are those that asymptotically cover A^, i.e., P"'([C„]^) — > 1, and whose sizes \Cn\ 
are relatively small. In our notation, if we take M(-) = 1, then M^{Cn) = \Cn\ and the rate 
function R{D) (from (^) or (|lO[)) reduces to Shannon's rate- distortion function 

RsiD) = inf inf H(W\\PxQ) 
Q iyg>i{p,Q,D) 

inf I{X;Y). 

{X,Y): X~P, Ep{X,Y)<D 



From Theorems 1 and 2 and Remark 1 we recover Shannon's source coding theorem (see [15|[g]). 

Corollary 2. (Data Compression) For any n>l, if the average distortion achieved by 
a codebook Cn is D = Epn[pn{X^,Cn)], then 

- log \Cn\> RsiD). 
n 



Moreover, for any D > 0, there is a sequence of codebooks {C„} such that £^pn[p„(X", C„)] -^ D, 
the codebooks Cn asymptotically cover A^, -P"([C„]^) -^ 1, and 

lim -log\Cn\ = Rs{D). 

Finally, in the context of measure-concentration, taking M = P and writing Rc{D) for the 
concentration exponent R{D;P,P), we get: 

Corollary 3. (Converse Measure Concentration) Let {Cn} be arbitrary sets, 
{i) For any n>l, if D = Epn[pn{Xf,Cn)], then P"(C„) > e'^^c{D)_ 

{ii) If P"([Cn]o) ^ 1> then 

liminf ilogP"(C„) > RciD). 

n— >oo fi 

(Hi) There is a sequence of sets {Cn} such that -P"([C„]^) —>■ 1 and (1/n) log P'^{Cn) -^ Rc{D), 
as n —> oo. 

In particular, in the case of the binary cube, part (ii) of the corollary provides a precise 
converse to the measure-concentration statement in (g). Although the concentration exponent 
Rc{D) = R{D', P, P) is not as explicit as the exponent —D'^/2 in (|6[), Rc{D) is a well-behaved 
function and it is easy to evaluate it numerically. For example, Figure 1 shows the graph of 
Rc{D) in the case of the binary cube, with P being the Bernoulli measure with -P(l) = 0.4. 
Various easily checked properties of R{D) = R{D;P,M) are stated in Lemma 1, below; proof 
outlines are given in the Appendix. 

As mentioned in the Introduction, the question considered in Corollary 3 can be thought of 
as the opposite of the usual isoperimetric problem. Instead of large sets with small boundaries, 
we are looking for small sets with the largest possible boundary. It is therefore not surprising 
that the extremal sets in (P) and in Corollary 3 are very different. In the classical isoperimetric 
problem, the extremal sets typically look like Hamming balls around 0" = (0,0, ... ,0) € A^, 
Bn = {^i ■ pn{xi,0^) < r/n} (see the discussions in Section 2.3 of [17|, p. 174 in |jll|, or the 



original paper by Harper [|10[), while the extremal sets in our case are collections of vectors y" 
drawn IID from the measure Q„ on A'^. 
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Figure 1: Graph of the function Rc{D) = R{D;P,P) for < D < 1, in the case of the binary 
cube A" = {0, 1}", with P(l) = 0.4. 



Lemma l.(i) The infima in the definitions of R{D) and I{P,Q,D) in ^ and ^ are in fact 
minima. 

(ii) R{D) is finite for all D > 0, it is nonincreasing and convex in D, and therefore also 
continuous. 

{Hi) For fixed P and Q, I{P, Q, D) is nonincreasing and convex in D, and therefore it is 
continuous except possibly at the point D = inf{Z? > : I{P,Q,D) < cx)}. 

(iv) If the random variables X" = {Xi, . . . ,X„) are IID, then for any random vector Y{^ 
jointly distributed with X^: 



I{X-;Y-)>Y,nX^;Yi). 



i=l 



{v) If we let Rmin = min{logM(y) : y G A} and 

-Dmax = -Dmax(-P) = mm{Ep[p{X,y)] : y .such that log M{y) = Rmin}, 



then 



R{D) is 



— -'Imin for U ^ i/max 

>Rmn, forO<D<D„ 



Next we prove Theorem 1. It is perhaps somewhat surprising that the proof is very short 
and completely elementary, relying only on Jensen's inequality and the convexity of R{D). 



Proof of Theorem 1: Given an arbitrary C„, let (pn '■ ^ -^ Cn be a function that maps 
each x1 G A" to the closest yf in Cn, i.e., Pn{x1 , (f){x1)) = /9„(x5^,C„). For Xf ~ P" let 
¥{" = (/>„(Xf), write Qn for the distribution of ¥{", and W„(xy,yf) = P"(2;5^)I{0„(^n)|(yf ) for 
the joint distribution of (X",y"). Then 

i?iyjp„(xr,yn] = ^ (11) 

and by Jensen's inequality, 



log M"(C„) = log 



V (o (v-) ^^^y^A 



By the definition of mutual information this equals 

IiX'^;Yn+EQjlogM'^iYn], 
which, by Lemma 1 (iv), is bounded below by 

n 

Y,[I{X^■,Y,)+EQJ\ogM{Y,)]]. 

Finally, by the definition of R{D) and its convexity this is bounded below by 

n / n \ 

Y,R{EwAp{X,,Y,)]) > nR -J2EwAp{X^,Y^)] = nR{D) 

where the last equality follows from ([Tl]). D 

3 Proof of Theorem 2. 

Let P, D > be fixed, and e > be given. By Lemma 1 (i) we can pick Q* and W* in the 
definition of R{D) and I(P, Q* , D), respectively, such that 

R{D) = H{W*\\PxQ*) + Eg^logMiY)] = I* + L* . 

For n > 1, write Q* for the product measure (Q*)", and for y" € A^ let 

1 "" 
1=1 

10 



denote the empirical measure of y". Pick 5 > (to be chosen later) and define, for each n > 1, 
the set of "good" strings 

Gn = {yr G ^" : Py^ib) < Q*{b) + d, V6 G A} 

(if Qn as defined above is empty - this may only happen for finitely many n - simply let Gn 
consist of a single vector {a, a, . . . ,a), with a £ A chosen so that logM(a) = i?min)- Also, let 
Qn be the measure Q* conditioned on Qn- 

For n > 1, let {Y{i) = {Yi{i), ¥2(1), . . . , Yn{i)) ; i > 1} be an IID sequence of random vectors 
Y[i) ~ Qn, and define Cn as the collection of the first e"'^^*'^'"''^' of them: 

Cn = {Y{i) : 1 < 2 < e"(^*+^/2)}^ 
By the definition of Gn, any y" G ^„ has 

^logM"(y5^) = J2 Py? (b) log M{b) < L* + 5 ( J] logM(6) J < L* + e/2, 
beA \b€A J 

by choosing 5 > appropriately small. Therefore, 

and (i) of the Theorem is satisfied. Let X" be IID random variables with distribution P. To 
verify (ii) we will show that 

in < e"(^*+'/^) eventually, PxQ - a.s. (12) 

where in is the index of the first Y{i) that matches X^ within yO„-distortion D, 

i„ = inf{i>l : pn{X^,Y{i))<D}, 

and Q = Un>iiQn)^- Recall the notation B{x'^,D) = {y^ G A" : />„(x5*,y^) < £»}. For ([l|) it 
suffices to prove the following two statements 



lim sup — log 



inQniBiX^,D)) <0 PxQ -a.s. (13) 



liminf-logQ„(S(Xf,L>)) > -r P - a.s. (14) 

Proving ( p!4[ ) is the main technical part of the proof and it will be done last. Assuming it holds, 
we win first establish (|l|). For m > 1 let G^ = {x'^ G ^°° : Qn{B{x'^,D)) > Vn > m], and 

11 



note that by (|T^), P {^m>iGm) = 1- Pick m > 1; for any n>m, and any x^ € Gm, conditional 
on Xf = x", i„ is a Geometric (pn) random variable with pn = Qn{B{xi,D)). So for e' > 
arbitrary 



Pr <i — log 

n 



inQn{B{X^,D)) 



>e 



X 



n ^n 



xn<(i-p„)- 



and for all n large enough (independent of x") this is bounded above by 



(l-Pn)'/^" 



< e" 



uniformly over x^ G Gm- Since the above right-hand side is summable over n, by the Borel- 
Cantelli lemma and the fact that e' > was arbitrary we get ( [I^ ) for P-almost all x^ G Gm- 
But since P(Um>iG'm) = 1, this proves (|l3|). 

Next we turn to the proof of (|l4|). Since, by the law of large numbers, Qn{Gn) — > 1, as 
n ^ oo, (|l^) is equivalent to 



liminf - log Q; (B(Xf , i?) n ^„) > -/* 

n— >oo 7i 



a.s. 



(15) 



Choose and fix one of the (almost all) realizations x"^ of P for which 

Px'iia) ^ P{a), for all a G A. 
Let ei G (0, 6) arbitrary, and choose and fix A^ large enough so that 

\Px^{a) - P{a)\ < eiP{a) for ah a G A, n> N. 
Let ai, 02, . . . , Om denote the elements of A, write no = 0, 

nPj;n{ai), i = l,2,...,m 



(16) 



m 



and Nj = J2k=o ^^■> i = 0, 1, . . . , m. For n > N, writing Y{^ = {Yi,Y2, . . . , Yn) for a vector of 
random variables with distribution Q*, we have that Q* {B{xi,D) n Qn) equals 



f 1 " 1 " 1 



{m Ni m Ni 

where we have used the fact that the Yi are IID (and hence exchangeable) to rewrite x" as 
consisting of ni oi's followed n2 a2's, and so on. Let 7j = P(aj) ^^^^ M^*(6[aj)/3(aj, 6) for 
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i = 1,2, . . . ,m. Recalling that, by the choice of W*, J2i li = ^wPi^, Y) < D, and that Q* is 
the y-marginal of W*, the above probability is bounded below by 

IlQnA'^-T.pia^,Y,)<J, and ^^I^^^^,^ < P(a,)[T^*(6k) + 5], V6gA^. 

, I "- "■i . , II' ii'i ■ ^ 

4=1 I ]=1 j=l 



Writing Fj = ji/[P{ai){l + ei)], i = 1,2, . . . ,m and using dig), this is in turn bounded below by 



n«.i;^i:''(°<'^i)sr. and lf;,„..,<q*i)±i. vi,.A 



1=1 



i=i 



i=i 



+ ei 






(17) 



i=l 



where -Fj is the collection of probability mass functions Q on A, 

Fi = F,(ei) = [q : i^^Q^a^y)] < T, and g(6) < ^'f'"^^+^ V6 G ^ 
L t + ei 

We will apply Sanov's theorem to each one of the terms in (|lj)- Consider two cases: If Fj > 
then Fi is the closure of its interior (in the Euclidean topology), so by Sanov's theorem 



ir-i^4^°^^- {^n- ^ ^4 ^ - J^J.^(^ii^* 



(18) 



(see Theorem 2.1.10 and Exercise 2.1.16 in |8[). If T, = then 7i = and this can only happen 
if VF*(-|aj) = I|aj(-), in which case Fi = {5ai} and 



-logQ: /Pyn, GFij = logQ*(a,) = -H{5a^\Q*) 

Hi » L 1 J 

so (|lq) still holds in this case. Combining the above steps (note that each nj — > oo as n ^ oo), 

' m 

WQI^Py-^ ^ F,] 

.1=1 

lim inf V P,n (oi ) — log Q* \ Py". £ Fi} 



liminf-logQ;(S(x^i:>)ng„) > lim inf -log 

n— >oo n n—foo n 



i=l 
m 

> -^P(a,) inf //(QIIQ*), 

4 = 1 

and this holds for P-almost any xf^. Rewriting the ith infimum above as the infimum over 
conditional measures VF(-|aj) € Fi, yields 

liminf-logg!(5(X?,L>)ngn) > - inf H(W\\PxQ*) P - a.s. 
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where -F(ei) = {W : Wx = P and W{-\ai) G -Fj(ei), Vi = 1,2,... ,m}. Finally, since ei was 
arbitrary we can let it decrease to to obtain 

liminfilogQ;(5(Xf,D)nan) > limsup[- inf H{W\\P>iQ*)\ 



(a) 



inf H{W\\PxQ*)] 

W(iF{0) 



(b) 

> -I* P-a.s. 

This gives ([l5|) and completes the proof, once we justify steps (a) and (b). Step (6) follows upon 
noticing that W* G F(0) and recalling that H{W*\\PxQ*) = I*. Step (a) follows from the fact 
that H{W\\PxQ*) is continuous over those W that are absolutely continuous with respect to 
PxQ*, and from the observation in Lemma 2 below (verified in the Appendix). □ 

Lemma 2. For all ei > small enough there exist Qi G i^j(ei) such that H[Qi\\Q*) < oo, for 
1 < i < m. Therefore, for all ei > small enough the exists W G Fi^i) with H{W\\PxQ*) < cxo. 

Note that, in the above proof, a somewhat stronger result than the one given in Theorem 2 
is established: It is not just demonstrated that there exist sets Cn achieving (i) and (ii), but 
that (almost) any sequence of sets C„ generated by taking approximately e IID samples from 
Qn will satisfy (i) and (ii). 

We also mention that Bucklew [Q] used Sanov's theorem to prove the direct part of Shannon's 
data compression theorem. The proof of Theorem 2 is similar, except that it involves a less 
direct application of Sanov's theorem to the sequence of non-product measures Q„, and the 
conclusions obtained are somewhat stronger (pointwise rather than L^ bounds). Similarly, in 
the proof of Theorem 4, the Gartner-Ellis theorem from large deviations is applied in a manner 
which parallels the approach of [||. 

4 The General Case 

Let ^ be a Polish space (namely, a complete, separable metric space) equipped with its associated 
Borel (T-algebra A, and let P be a probability measure on {A^, A^). Also let {A, A) be a (possibly 
different) Polish space. Given a nonnegative measurable function p : AxA ^ [0,oo), define 
Pn 1 A^ X A^ — > [0, oo) as in (|l|). [The reason for considering A as possibly different from A 
is motivated by the common data compression scenario, where, in practice, it is often the case 
that original data take values in a large alphabet A (for example, Gaussian data have j4 = M), 
whereas compressed data take values in a much smaller alphabet (for example, Gaussian data 
on a computer are typically quantized to the finite alphabet A consisting of all double precision 
reals) .] 
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Let {^n} be a sequence of random variables distributed according to P, and for each n > 1 
write Pn for the n-dimensional marginal distribution of Xf. We say that P is a stationary mea- 
sure if X^ has the same distribution as -^"i. , for any n, k. Let M : A —^ (0, oo) be a measurable 
"mass" function on A. To avoid uninteresting technicalities, we will assume throughout that 
M is bounded away from zero, M{y) > M^, for some constant M* > and all y ^ A. Next we 
define the natural analogs of the rate functions /(P, Q, D) and R(D). For n > 1, D > and Qn 
a probability measure on (A",^"), let 

In{Pn,Qn,D)= inf HiWn\\PnXQn) (19) 

Wn&Mn{Pn,Qn,D) 

where H^^Wiy) denotes the relative entropy between two probability measures /x and i^ 

^(^11^) = I /^/^log^' when^ exists 
1 (X), otherwise 

and where A4n{Pn,Qn, D) consists of all probability measures Wn on {A'^ x A^ , A^ x A^) such 
that Wn,x, the first marginal of Wn, is equal to P„, the second marginal Wn,Y is Qn, and 
J PndWn < D; ii Mn{Pn,Qn,D) IS empty, let In{Pn,Qn,D) = cx). Then Rn{D) is defined by 

Rn{D) = Rn{D;Pn,M) = mf{In{Pn,Qn,D) + EQjlogM'' {¥{")]}, (20) 

where the infimum is over all probability measures Q„ on (A", ^l"). Note that since I„(P„, Q„, L>) 
is nonnegative and M is bounded away from zero, Rn{D) is always well-defined. Recall also that 
the mutual information between two random vectors Xf and Y"" with joint distribution Wn and 
corresponding marginals Pn and Qn, is defined by I{X^; Y{^) = H{Wn\\Pn^Qn)-, so that Rn{D) 
can alternatively be written in a form analogous to ( [Tol ) in the discrete case: 

i?„(Z^) = inf {I{X^; Fi") + E[logM"(yi")]} . 

Finally, the rate function R{D) is defined by 

R{D) = lim -RniD) 

n— >oo 77, 

whenever the limit exists. Next we state some simple properties of Rn{D) and R{D), proved in 
the Appendix. 

Lemma 3.(i) For each n > 1, Rn{D) is nonincreasing and convex in D > 0, and therefore 
also continuous at all D except possibly at the point 

D^^^ = ini{D > : Rn{D) < +oo}. 
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(a) If R{D) exists then it is nonincreasing and convex in D > 0, and therefore also contin- 
uous at all D except possibly at the point 

Dmin = inf{D > : R{D) < +00}. 

(Hi) If ¥ is a stationary measure, then 

R{D) = lim -Rn{D) = inf -Rn{D) exisits, 
n-^00 n n>i n 

and Dmin = infn D^^(^. 

[iv) The mutual information I{X'^;Y^) is convex in the marginal distribution Pn of Xf, for 
a fixed conditional distribution of Y"" given Xf . 

Next we state analogs of Theorems 1 and 2 in the general case. As before, we are interested 
in sets C„ that have large blowups but small masses; since M is bounded away from zero we 
may restrict our attention to finite sets Cn- 

Theorem 3. Let Cn C A^ be an arbitrary finite set and write D = Ep^[pn{Xi, Cn)]- Then 

logM"(C7„)>i?„(D). (21) 

If F is a stationary measure, then for all n > 1 

logM"(C7„) >nR{D). 

As will become apparent from its proof (at the end of this section), Theorem 3 remains 
true in great generality. The exact same proof works for arbitrary (non-product) positive mass 
functions M„ in place of M", and more general distortion measures pn, not necessarily of the 
form in (ffl). Moreover, as long as Rn{D) is well-defined, the assumption that M is bounded 
away from zero is unnecessary. In that case we can also consider countably infinite sets C„, and 



(21) remains valid as long as Rn{D) is continuous in D (see Lemma 3). 

In the special case when P is a product measure it is not hard to check that Rn{D) = nR{D) 
for all n > 1, so we can recover Theorem 1 from Theorem 3. 

For Theorem 4 some additional assumptions are needed. We will assume that the functions 
p and logM are bounded, i.e., that there exist constants Pmax ^ and Lmax < 00 such that 
pix,y) < Pmax and jlogAf(y)| < Lmaxi for all x G A, y £ A. For A; > 1, we say that P is 
stationary (respectively, ergodic) in /c-blocks if the process {Xn , n > 0} = {-^^'^^.i ', n > 0} 
is stationary (resp. ergodic). If P is stationary then it is stationary in A;-blocks for every k. But 
an ergodic measure P may not be ergodic in /c-blocks. For the second part of the Theorem we 
will assume that P is ergodic in blocks, that is, that it is ergodic in /c-blocks for all /c > 1. Also, 
since R{D) = 00 for D below Dminj we restrict our attention to the case D > I^min- Theorem 4 
is proved in the next section. 
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Theorem 4. Assume that the functions p and logM are bounded, and that ¥ is a stationary 
ergodic measure. For any D > Dmin o-nd any e > 0, there is a sequence of sets {C„} such that: 

(i) - log M'\Cn) < R(D) + e for all n > 1 

n 

(a) Pni[Cn]o) ^1 as n ^ oo. 

If, moreover, P is ergodic in blocks, there are sets {C„} that satisfy (i) and 

(Hi) pn{Xi,Cn) < D eventually, P — a.s. 

Remark 3. A corresponding version of the asymptotic form of Theorems 1 and 2 given 
in Remark 1 of the previous section can also be derived here, and it holds for every stationary 
ergodic P. 

Remark 4. The assumptions on the boundedness of p and log M are made for the purpose of 
technical convenience, and can probably be relaxed to appropriate moment conditions. Similarly, 
the assumption that M" is a product measure can be relaxed to include sequences of measures 
Mn that have rapid mixing properties. Finally, the assumption that P is ergodic in blocks is not 
as severe as it may sound. For example, it is easy to see that any weakly mixing measure (in 
the ergodic-theoretic sense - see [|l^) is ergodic in blocks. 



Proof of Theorem 3: Given an arbitrary C„, let (/>„ : A'^ -^ Cn be defined as in the proof 
of Theorem 1. For X" ~ P„ define Yj" = 0„(X"), write Qn for the (discrete) distribution 
of Fi", and Wnidx'l^dyl) = Pn{dx1)5^^(^^n){dy1) for the joint distribution of {X'l,Y{'). Then 
EwrXPn^X"^ lYi)] = D-, and by Jensen's inequality applied as in the discrete case 

/If"!'?/"') 
log M«(C„) > Y. Qn{y-,)\o^jyf^ 



y"dH^„(x^,r)log^^f^+ E Qn(yr)iogM"(y 



= I{X^;Y{^)+EQjlogAnYr)]- 

By the definition of Rn{D), this is bounded below by Rn{D). The second part follows immedi- 
ately from the fact that Rn{D) > nR{D), by Lemma 3 {ii). □ 

5 Proof of Theorem 4 

The proof of the Theorem is given in 3 steps. First we assume that P is ergodic in blocks, and 
for any D > D^l^ we construct sets Cn satisfying (i) and (Hi) with Ri{D) in place of R{D). In 
the second step (still assuming P is ergodic in blocks), for each D > -Dmin we construct sets Cn 
satisfying (i) and (Hi). In Step 3 we drop the assumption of the ergodicity in blocks, and for 
any D > -Dmin we construct sets C„ satisfying (?) and (ii). 
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5.1 Step 1: 

Let P and D > D\J^^ be fixed, and let an arbitrary e > be given. By Lemma 3 we can clioose 
a D' G (-Dmini D) sucli that Ri{D') < Ri{D) + e/8 and a probability measure Q* on {A, A) such 
that 

r+L* = /i(Pi, Q*, D') + Eq* [log M(y)] < Ri{D) + e/8 < Ri{D) + e/4. (22) 

Also we can pick aW* e Mi{Pi,Q*,D') such that 

H{W* \\Pi X Q*) < r + e/4. (23) 

As in the proof of Theorem 2, for n > 1, write Q* for the product measure (Q*)", and define 

nn=ly'^eA^ : i ^logM(yO < L* + e/4 I . 

Let Qn be the measure Q* conditioned on Tin, Qn{F) = Qn{F n Tin) / Qni'^n) , for F £ A^. 
For each n > 1, let {l^(i) = (5^i(^), ^2(^)5 ■ ■ ■ > ^n(O) ; « > 1} be IID random vectors Y{i) ~ Qnj 
and define 

Cn = {Y{i) : l<i< e"(^*+^/2)|^ 

By the definition of Hn, any yf G g„ has M"(yf ) < e"(^*+^/4), so by (H) 

and (i) of the Theorem is satisfied with Ri{D) in place of R{D). Let X" be a random vector 
with distribution P„, and, as in the proof of Theorem 2, let i„ be the index of the first Y{i) that 
matches X^ within /?„-distortion D. To verify {Hi) we will show that 

in < e"(^*+'/2) eventually, P x Q - a.s. 

where Q = Y\n>i{Qnf^ ^ ^^^ t'^i^ ^i^^ follow from the following two statements: 

limsup-log kQ„(5(Xf,L>))l <0 PxQ-a.s. (24) 



n— >oo 



n 



\\mmi-\ogQn{B{X?,D))>-{r + e/A) P - a.s. (25) 

n— >oo 77, 



The proof of ( |24D is exactly the same as the proof of ( |13D in the proof of Theorem 2. To prove 
(|25|) , first note that by the law of large numbers Q* (7^„) — > 1, as n ^> cxo, so (|25|) is equivalent 
to 

liminf i log Q;(S(Xi",Z))nH„) >-(/* + e/4) P - a.s. (26) 

n— >oo Ji 
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Let Yi,Y2, . . . be IID random variables with common distribution Q*. For any realization x^ 
of P, define the random vectors ^j and Z„ by 






{p{x„ Yi), log M {¥,)), i>l 

-y^^i, n>l. 

n ^ — ^ 



i=l 



Also let A„(A) be the log-moment generating function of Z„, 



A„(A) = logS 



= (A,^n) 



A = (Ai,A2)G 



where (•, •) denotes the usual inner product in E?. Then for P-almost any xf^, by the ergodic 
theorem, 



-A„(nA) 
n 



n I 



1 



n 



^^og Eq, 



i=l 



Ep^ i^log Eq, I 



„Aip(a;„y)+A2 log M(y) 



,Aip(X,Y)+A2logAf(y) 



(27) 



where X and Y above are independent random variables with distributions Pi and Q* , respec- 
tively. Next we will need the following lemma. Its proof is a simple application of the dominated 
convergence theorem, using the boundedness of p and logM. 

Lemma 4. For k > 1 and probability measures /x and v on {A^,A^) and {A^^Al'), respec- 
tively, define 



Am,.(A) 



log 



1 



fkl „,k^ 



exp ( Aipfc(xf,yf) + A2-^ log M'^(yf 



dv{yl)\dp{x\), 



for A = (Ai, A2) € M^. Then K^^y is convex, finite, and dijjerentiable for all A € 



From Lemma 4 we have that the limiting expression in (27), which equals Ap^ g*, is finite and 
differentiable everywhere. Therefore we can apply the Gartner-Ellis theorem (Theorem 2.3.6 in 
1^]) to the sequence of random vectors Z„, along P-almost any xf°, to get 



liminf-logQ;(5(x?,D)n?i:„) = liminfilogPr(Z„ G F) > - inf A*(z) 

n^oo n n^oo n zeF 

where F = {z = {zi, Z2) E M^ : zi < D, Z2 < L* + e/4} and 

AK,Q*W = sup[(A,z) - Ap,,Q.(A)] 



a.s. 



(28) 



is the Fenchel-Legendre transform of Ap^.g. (A). Recall our choice of W* in (p3[). Then for any 
bounded measurable function (p : A ^M. and any fixed x £ A, 

H{W*{-\x)\\Q*{-)) > [ ^{y)dW*{y\x)- log [ e^^yUQ*{y) 
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(see, e.g., Lemma 6.2.13 in [Q]). Fixing x & A and A € M^ for a moment, take (j){y) = Xip{x, y) + 
A2logM(y), and integrate both sides dPi{x) to get 

H{W*\\Pi^Q*) > XiEw'ip) + X2EQ4ogM{Y)] - Ap,,Q*{X). 

Taking the supremum over ah A E M^ and recahing (^3|) this becomes 

r + e/4 > H{W*\\PixQ*) > A*p^Q,{D*,L*) 

where D* = f pdW* < D' < D, so 

r + e/4>mfA|,^,Q,(z). 



Combining this with the bound ( |28D yields (^) as required, and completes the proof of this 
step. 

5.2 Step 2: 

Let P and D > L'min be fixed, and an arbitrary e > be given. By Lemma 3 we can pick k > 1 
large enough so that D^l^ < D and {\/k)Rk{D) < R{D) + e/8. This step consists of essentially 
repeating the argument of Step 1 along blocks of length k. Choose a D' € (^min'-^) such that 



^Rk{D')<^Rk{D) + e/m, 



and a probability measure Q^ on {A ,A ) achieving 

11+ LI t ^I,{P,,Ql,D') + lEQ*^[logM\Y,')] < Ir,{D') 



so that 



r,+Ll<R{D)+e/4. 



Also pick a W^ E Mk{Pk,Ql,D') such that 



lH{W^\\PkxQl)<Il + e/A. 



(29) 



(30) 



(31) 



(32) 



For any n > 1 write n = mk + r for integers m, > and < r < fc, and define 

■Hr 

Write Q* ^ for the measure 



W=|2/"G> : ^f;iogM(y,)<L^ + e/4l. 



.4 = 1 



>< [Qk]ri 
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where [Ql]r denotes the restriction of Q^, to (A^ ,A^), and let Qn.k be the measure Q* ^ condi- 
tioned on 'Hn,k- For each n > 1, let {Y{i) = {Yi{i),Y2{i), . . . ,Yn{i)) ; i > 1} be IID random 
vectors Y{i) ~ Qn, and let C„ consist of the first e"'^fc^'^'^) of them. As before, by the definitions 
of 'Hn,k and C„, and using (|3l|) , it easily follows that 

-logAr{Cn)<R(.D) + e 
n 

so (i) of the Theorem is satisfied. Let Yi, Yi-, . . . , Yn. be distributed according to Q* j^,, and note 

that the random vectors Y^^,^ are IID with distribution Q^ (for i = 0, 1, . . . , m — 1). Therefore, 

as n — > (X), by the law of large numbers we have that with probability 1: 

i j;iogM(y,) < (-) - E logM^(y/^:f ) + ^i^^ ^ L^ (33) 

i=\ i=0 

Following the same steps as before, to verify {iii) it suffices to show that 
liminf - logg„.fe(S(Xi", D)) > -(4* + e/4) P - a.s. 

n—>oo n 

and, in view of (p3[), this reduces to 

liminf - logQ: ;, (S(Xr, D) n H„,fe) > -(4* + e/4) P - a.s. (34) 

Ti^oo 77, ' 

For an arbitrary realization x^ from P and with Y" as above, consider blocks of length k. For 
i = 0, 1, . . . , 7n — 1, we write 

>^''=>;Ui and x)^=x^^^^/ 
so that the probability Q* ^ (S(X", D) n TCn,k) can be written as 

' \ \ n J m ^-^ n 

and (— )-X;rl°g^'(^/'V-l°g^^(^n%+i)^^^ + ^/4- 

Since we assume p{x^ y) < pmax and | log M{y)\ < Lmax for all x G j4, y G ^, then for all n large 
enough (uniformly in x'^) the above probability is bounded below by 

{171 — 1 771 — 1 ^ 

l^p,(yf),£f))<Z?' + e/8 and 1 j;ilogMnyf))<L^ + e/8 . 
i=0 j=o J 

Now we are in the same situation as in the previous step, with the IID random variables Y^ in 
place of the Yi, the ergodic process {X!^ } in place of {Xi}, and D +e/8 in place of D. Repeating 
the same argument as in Step 1 and invoking Lemma 4 and the Gartner-Ellis theorem, 

limmi-logQ*JB{X'^,D)nnnk)>- mf Afe(^i,^2) P - a.s. (35) 

n^oo n ' ' zi<D'+e/8, Z2<Ll+e/8 
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where, in the notation of Lemma 4, A^(2:) is the Fenchel-Legendre transform of Ap^^Q*(A). 
Recah our choice of Wj^ in (^) and write W^ = f PkdW^ < D'. Then by an apphcation of 
Lemma 6.2.13 from Q together with ( |3^ ) we get that 

4* + 6/4 > ^H{W^\\PkxQl) > AUD\Ll), 
and this together with (p5|) proves p^), concluding this step. 

5.3 Step 3: 

In this part we invoke the ergodic decomposition theorem to remove the assumption that P is 
ergodic in blocks. Although somewhat more delicate, the following argument is very similar to 
Berger's proof of the abstract coding theorem; see pp. 278-281 in ||2|. 

As in Step 2, let P and D > Dmin be fixed, and let an e > be given. Pick k > 1 large 
enough so that D^J^l < D and ^RkiD) < R{D) + e/8, and pick D' € (d)^/^, D) such that ^ 



holds. Also choose Ql and W^ as in Step 2 so that @), (|3l|) and (|3|) all hold. 

Let il = (^ ) , T = {A ) , and note that there is a natural 1-1 correspondence between 
sets in F G ^^ and sets in F G (^'^)^: Writing Xj = xf^_^l , 

F = {£^ : x?° G F}. (36) 

Let // be the stationary measure on (fi,^) describing the distribution of the "blocked" process 
{Xi = A^j^^ I / ; i > 0}, where, since k is fixed throughout the rest of the proof, we have dropped 
the superscript in X^- . Although /i may not be ergodic, from the ergodic decomposition theorem 
we get the following information (see pp. 278-279 in |Q). 

Lemma 5. There is an integer k' dividing k, and probability measures fj,i, i = 0,1, . . . ,k' — 1 
on (ri,^) with the following properties: 

» M = (iA')E-loV.- 

(a) Each fii is stationary and ergodic. 

{Hi) For each i, let P^*) denote the measure on (^4^,^^) induced by jXi: 



P»(F)=/ii(F), Fg^^ 

[recall the notation of ^d{)]. Then P = {l/k') Y2i=o ^ ' ^'^'^ ^^'^^ ^ ^-^ stationary in k' -blocks 
and ergodic in k' -blocks. 

{iv) For each < i < k' and j > 0, the distribution that F^^' induces on the process 
{Xj+n ; n>l} is p(i+imodfc'). 

For each i = 0, 1, . . . , fc' — 1, let /ij^i denote the first-order marginal of /Uj and write R{D\i) = 
Ri{D; Hi^i, M) for the first-order rate function of the measure in, with respect to the distortion 
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measure pk, and with mass function M = M^ . Since W^ chosen as above has its j4 '^-marginal 
equal to P^ we can write it as W^ = V^ o P^ where y^*(-|X") denote the regular conditional 
probability distributions. Write P^ for the fc-dimensional marginals of the measures P(*\ and 
define probability measures W^'^ on (A"x i", ^"xi") by W^'^ = V^oP^\ Let D, = J pu dw'^'^ 
so that by Lemma 5 [in), 

k'—l 

liJ2^^ = Jp'^dW^^D'. (37) 



k' 

1=0 

Similarly, writing Qj^ for the A^-marginal of W^ and applying Lemma 5 {in), 

i Y^ j log M\y\)dQf{y'i) = f log M^yl) dQUvf) (38) 

j=o -' •' 

and using the convexity of mutual information from Lemma 3 {iv), 

fc'-i 
- ^ i?(T^f||P« xQ«) < H{Wt\\PuxQl). (39) 

i=0 

For A'^ > 1 large enough we can use result of Step 1 to get A'^-dimensional sets Bi that almost- 
cover {A ) with respect to pi. Specifically, consider A^ large enough so that 

maxjp^ax, I^max, 1} ^ ^^^^^/^^ (^ _ ^/)/2}. (40) 

fCiV 

For any such A^, by the result of Step 1 we can choose sets Bi C {A^)^ such that, for each i, 

Pi{[Bi]^\ > I — En, where eAT —> as A^ —> oo, and (41) 

M^iBi) < exp{NiR{D,\i) + €/8)}. (42) 

Now choose and fix an arbitrary y* G A, and for n = k'{Nk + 1) define new sets P^* C A"' by 

k'-i 

^i = n [Bt+jmodk'X{y*}], 
3=0 

where Y\ denotes the cartesian product. Then, by (^0|), for any x", 

D-D' 1 '''^^ 



i=o 



f n j^*^ ^ ^ - ^ , ^ Sr^ ( j{kN+l)+kN „ \ 

Pn{Xi , -Oj j < ^ ^ y 1^ P^N \^j{kN+l)+l ' ^i+jmodk'j , 

SO by a simple union bound, 

(a) 



p« ([s*] J > 1 - 5] [i - p(^+^-'-^'^') ([A+,„.odfc'],: 

i=o 

1-E [i-M.([^«]. 



i — 

i=0 

> 1 - k'cN, (43) 
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where we used ( p7[ ) in (a), Lemma 5 (iv) in (6), and ( |4l| ) in (c). Also, using the definition of B* 
and the bounds (HO) and (pT 






n 



A:iV + 



1 

fciV 



logM'^Bi+j^odi 



k'-i 



-- '/8+^E 



i=o 



^(R(D,\j) + e/8) 



but from the definition of R{D\j) and ( P9|) and (38) this is 

k' — l 
j=0 

< Il + Ll + e/2 

< R{D) + 3e/4, (44) 

where the last inequality follows from (|3l|). So in (|43| ) and (|4^ ) we have shown that, for all 
z = 0, l,...,fc'-l. 



P^'^([5»*]z,) > l-k'eN and 
ilogM"(5*) < i?(Z)) + 3e/4. 



(45) 
(46) 



Finally we define sets Cn C A" by 



Cn = uij^B:. 



From the last two bounds above and (40), the sets C„ have 



- log Af"(C„) < i^ + R{D) + 3e/4 < R{D) + e, 
n n 



and by Lemma 5 {in), 

Pn {[Cn],) = ^ 5;V« ([C„] J > i X; P« ([B*] J > 1 - 6:, 



fc'-l 



fe'-l 



i=0 i=0 

where e^ = k'eN when n = k'{Nk + 1). 

In short, we have shown that for any D > -Dmin and any e > 0, there exist (fixed) integers 
k, k' and Nq such that: 

There is a sequence of sets C„, for n = k'{Nk + 1), A^ > Nq, satisfying: 

(1/n) log M'^{Cn) < R{D) + e for ah n, and 

Pn{[Cn]D) ^^ asn^oo. 
Since this is essentially an asymptotic result, the restrictions that N > Nq and n be of the form 
n = k'{Nk + 1) are inessential. Therefore they can be easily dropped to give (+) for all n > 1, 
that is, to produce a sequence of sets {C„ ; n > 1} satisfying {i) and {ii) of Theorem 4. □ 



(+) 
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Appendix 

Proof of Remark 1: In view of part (i) of Theorem 2 and the remark that P" ([C„]^^ 
for every rri > 1 we can choose a sequence of sets {Cn ] n>l} such that 

- log M'^'iC^'^h < RiD) + — , for all m, n > 1, and 

n m 

P" {[Cl^'\) > 1 - — , for all m > 1, n > N{m), 



m 

where N[m) is some fixed sequence of integers, strictly increasing to oo as rn, ^ oo. So for 
each re > 1 there is a unique m = m{n) such that N{m) < re < N{m + 1). Since {A^(?tt-)} is 
strictly increasing, the sequence {reT,(n)} is nondecreasing and reT,(n) ^ oo as n ^ oo. Define 
C* = C™ for all n > 1. From the last two bounds, 

- log M"(C7*) < R(D) + ^-, for all re > 1, and 
re m(n) 

P"([C:],) > ^ for an re > iVMre)). 

myn) 

But since re is always re > N{m{n)) by definition, and rre(re) ^ oo as re ^ oo, this proves {i') 
and [ii'). Also, since p is bounded, {in') immediately follows from {ii')- □ 

Proof outline of Lemma 1: For part [i) it suffices to consider the case I(P, Q, D) < oo, so we 
may assume that the set M{P, Q, D) is nonempty. Since the marginals of any W G A4(P, Q, D) 
are P and Q, VF is absolutely continuous with respect to PxQ, so H{W\\PxQ) is continuous 
over M^ G M.{P,Q,D). Since the sets M{P^Q^D) are compact (in the Euclidean topology), 
the infimum in (^) must be achieved. A similar argument works for R{D): Combining the two 
infima in its definition, 

RiD)= mf {HiW\\WxxWY) + EwA^ogM{Y)]}, (47) 

W£M(P,D) 



where A4{P, D) = UqA^(P, Q, D). Since the sets M.{P, D) are compact, the infimum in ( |47| ) is 
achieved by some VF* G A^(P, !?), and Q* = Wy achieves the infimum in (^). 

For part {ii) recall the assumption that for all a G ^ there is 5 = h{a) such that /?(a, b) = 0. 
If we let W{a,b) = P(a)I|b=6(„)|, then W G M{P,D) for any P» > and from g^), P(P)) < 
Ewyllog M{Y)] < oo for all D > 0. Since the sets ^A{P, D) are increasing in D, R{D) is 
nonincreasing. To see that it is convex, let W G M{P,Di) and VF' G M{P,D2) arbitrary. 
Given A G [0, 1] let A' = 1 - A, and write V = XW + X'W. Then V e M{P, AP>i + A'Ds) and 
the y-marginal of V, Vy, is XWy + X'Wy. Recalling (|47|) and that relative entropy is jointly 
convex in its two arguments, 

P(AP»i + X'D2) 

< H{V\\VxxVY)+Evy[logM{Y)] 

< X{H{W\\WxxWy) + Ewy[logM(Y)]} + A' \^H{W'\\W^xWi.) + SvK^[logM(y)]} . 
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Taking the infimum over all W e M{P,Di), W £ M{P,D2), and using (g^) shows that R{D) 
is convex, and since it is finite for all Z) > it is also continuous. 

The proof of (iii) is essentially identical to that of (ii), using the definition (|8|) in place of 
([47|). The only difference is that I{P,Q,D) can be infinite, so its convexity (and the fact that 
it is nonincreasing) imply that it is continuous for D > except possibly at D = mf{D > : 
I{P,Q,D) <oo}. 

Part (iv) is a well-known information theoretic fact; see, e.g., Lemma 9.4.2 in ^j. 



For part (v) let W* achieve the infimum in (47). Since relative entropy is nonnegative we 
always have R{D) > Rmm, with equality if and only if Wp is supported on the set A' = {y £ 
A : log M{y) = Rmin} and W* = W^ xWy- Clearly, these two conditions are satisfied if and 
only if 

D > mi{Epy,Q[p{X, Y)] : Q supported on A'}, 

but the right hand side above is exactly equal to Dmax- n 

Proof of Lemma 2: If 7j = then, as discussed in the proof of Theorem 2, -Fi(ei) = {^a;} for 
all ei and 

H{6a^\Q*) = -\ogQ*{ai) < -logP(a,) < oo. 

If 7i > then there must exist a 6* G ^4, 6* / a^, such that W*{h*\ai) > 0. Write dmax for the 
maximum of X^^ W^*(^|aj)/o(aj, &) over all j = 1, . . . ,m, and let pmin = min{p(a, 6) : a ^ h}. 
For a e (0,1), let 

{W*{ai\ai) + a if 6 = Oj 
W*ib*\a^)-a if 6 = 6* 
W*{b\ai) otherwise. 

Then, for ei small enough to make {6 — ei)pmm > eidmax(l + ^i), it is an elementary calculation 
to verify that Qi E i^i(ei) and H{Qi\\Q*) < oo, as long as a satisfies the following conditions: 





a 


< 


1-W*{ai\ai 




a 


< 


W*{b*\a,) 


£l"max 


< a 


< 


5-ei 



Pmin 1 + ei 

Taking W{ai, h) = Qi{b)P{ai) we also have W G F(ei). D 

Proof of Lemma 3: Since the sets A4n{Pn, Qn, D) are increasing in D, Rn{D) is nonincreasing 
in D. Next we claim that relative entropy is jointly convex in its two arguments. Let p, v be two 
probability measures over a Polish space (5,5). In the case when p and v both consist of only 
a finite number of atoms, the joint convexity of iJ(/u||z^) is well-known (see, e.g., Theorem 2.7.2 
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in [^). In general, H{ij,\\i/) can be written as 

where the supremum is over all finite measurable partitions of S (see Theorem 2.4.1 in |1J]). 
Therefore H{^\\v) is the pointwise supremum of convex functions, hence itself convex. As in 
(47), combining the two infima, Rn{D) can equivalently be written as 

Rn{D)= inf {H{Wn\\Wn,x^Wn,Y) + Ew^A^^&M'"{yr)]] (48) 

where A^„(P„, D) = Uq„ A1„(P„, Q^, D). Using this together with the joint convexity of relative 
entropy as in the proof of Lemma 1 (n) shows that i?„(D) is convex. Since it is also nonincreasing 
and bounded away from — oo, Rn{D) is also continuous at all D except possibly at the point 

L>(„t = inf{Z) > : Rn{D) < +00}. 

This proves (i). For {ii) notice that if R{D) exists then it must also be nonincreasing and convex 
in Z? > since Rn{D) is; therefore, it must also be continuous except possibly at -Dmin- 

For part (ui), let m, n > 1 arbitrary, and let Wm € Mm{Pm, D) and W„ G M-n{Pn,D). 
Define a probability measure Wm+n on {A"- y. A^ , A^ y. A^) by 

M^^+„(dxr",d2/r") = Wm{dyT\xT)Wn{dyZVl\xZXi)ndxT^''). 

Notice that Wm+n G Mm-\-n{Pm+n,D), and that, if (X™+",y™+") are random vectors dis- 
tributed according to Wm+n, then Y^ and YJ^^^ are conditionally independent given Xj""*""^. 
Therefore, 

Rm+n{D) < H{Wm+n\\Wm+n,Xy<Wm+n,Y) + Ew^+„,y[^OgM'^+^{Y,^+'')] 

= i(x-+"; y-+-) + Ew^+^.y [log M™+"(y-+-)] 

< I{Xr; YD + /(X™+r ; y-+i") + Ew^.r [log M-(yr)] + Ew^.r [log M"(yi")] 

where (a) follows from ( ^8[ ) and {h) follows from the conditional independence of Yj™ and Y^^^^ 
given x™"*"" (see, e.g.. Lemma 9.4.2 in [^). So we have shown that Rm+n{D) is bounded above 
by 

H{Wm\\Wm,X^Wm,Y) + Ew^,y[logM'^{Yr)]+ H{Wn\\Wn,xy<Wn,Y) + Ew^^AlogW^iY^)], 
and taking the infimum over all Wm G ■M.m{Pm, D) and Wn € Aln(-Pn5 -D) yields 

Rm+niD) < Rm{D) + Rn{D). (49) 

27 



[Note that in the above argument we imphcitly assumed that we could find Wm G -MmiPm, D) 
and Wn € A4n{Pn, D); if this was not the case, then either Rm{D) or Rn{D) would be equal 
to +(X), and (^9|) would still trivially hold.] Therefore the sequence {Rn{D)} is subadditive 
and it follows that lim„(l/n)i?„(D) = inf„(l/n)i?„(D). From this it is immediate that Z^min = 
inf D^"^ 

Part (iv) is a well-known information theoretic fact; see, e.g., Problem 7.4 in [^. D 

References 

[1] R.R. Bahadur. Some Limit Theorem,s in Statistics. SIAM, Philadelphia, PA, 1971. 

[2] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Com,pression. Prentice- 
Hall Inc., Englewood Cliffs, NJ, 1971. 

[3] R.E. Blahut. Hypothesis testing and information theory. IEEE Trans. Inform. Theory, 
20(4):405-417, 1974. 

[4] J. A. Bucklew. The source coding theorem via Sanov's theorem. IEEE Trans. Inform. 
Theory, 33(6):907-909, 1987. 

[5] J. A. Bucklew. A large deviation theory proof of the abstract alphabet source coding theo- 
rem. IEEE Trans. Inform. Theory, 34(5): 1081-1083, 1988. 

[6] T.M. Cover and J. A. Thomas. Elements of Information Theory. J. Wiley, New York, 1991. 

[7] I. Csiszar and J. Korner. Information Theory: Coding Theorems for Discrete Memoryless 
Systems. Academic Press, New York, 1981. 

[8] A. Dembo and O. Zeitouni. Large Deviations Techniques And Applications. Second Edition. 
Springer- Verlag, New York, 1998. 

[9] R.M. Gray. Entropy and Information Theory. Springer- Verlag, New York, 1990. 

[10] L.H. Harper. Optimal numberings and isoperimetric problems on graphs. J. Combinatorial 
Theory, 1:385-393, 1966. 

[11] C. McDiarmid. On the method of bounded differences. In Surveys in combinatorics (Nor- 
wich, 1989), pages 148-188. London Math. Soc. Lecture Note Ser., 141, Cambridge Univ. 
Press, Cambridge, 1989. 

[12] C. McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathemat- 
ics, pages 195-248. Algorithms Combin., 16, Springer, Berlin, 1998. 



28 



[13] K. Petersen. Ergodic Theory. Cambridge University Press, Cambridge, 1983. 

[14] M.S. Pinsker. Information and Information Stability of Random Variables and Processes. 
Holden-Day, San Fransisco, 1964. 

[15] C.E. Shannon. Coding theorems for a discrete source with a fidehty criterion. IRE Nat. 
Conv. Rec., part 4:142-163, 1959. Reprinted in D. Slepian (ed.), Key Papers in the Devel- 
opment of Information Theory, IEEE Press, 1974. 

[16] V. Strassen. Asymptotische Abschatzungen in Shannons Informationstheorie. In Trans. 
Third Prague Conf. Information Theory, Statist. Decision Functions, Random Processes 
(Liblice, 1962), pages 689-723. Publ. House Czech. Acad. Sci., Prague, 1964. 

[17] M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. 
Inst. Hautes Etudes Sci. Publ. Math., No. 81:73-205, 1995. 



29 



